Attending to Colleg undoubtedly have an impact on people´s lives,
also those people that went to college make an impact of their own
on others persons, communities and workplace.
I want to know if going to college makes a difference in Baseball players,
their teams and their income as professional players.
These are the questions I want to answer with this analysis:
1. On a yearly basis, does Baseball players who attended to college have greater income compared to those who didn't?
2. Having College graduated player gives teams better chance to win World Series?
3. Having more College graduated players makes teams more successful during the season?
Players who attended to college
1. Income of players
2. Chance of winning World Series
3. Performance of the teams
In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#Loads important files for this analysis
filename = 'baseballdatabank-2017.1\core\Teams.csv'
teams_df = pd.read_csv(filename)
filename = 'baseballdatabank-2017.1\core\Salaries.csv'
salaries_df = pd.read_csv(filename)
filename = 'baseballdatabank-2017.1\core\Master.csv'
master_df = pd.read_csv(filename)
filename = 'baseballdatabank-2017.1\core\CollegePlaying.csv'
college_df = pd.read_csv(filename)
filename = 'baseballdatabank-2017.1\core\Appearances.csv'
appearances_df = pd.read_csv(filename)
In [2]:
#In the master table get those players who went to college or not
went = college_df.playerID.unique()
#append the new column to the original dataFrame
master_df['went_to_college'] = master_df.playerID.isin(went)
In [3]:
#get players who attended to college and those who didn't
college_playing = master_df[master_df['went_to_college']]
non_college_playing = master_df[master_df['went_to_college']==False]
In [4]:
#now get the salaries of the players to describe and compare
#create an index dataFrame of the ssalaries table to splti the table
#no need to append this column to that table
salaries_df['went_to_college'] = salaries_df.playerID.isin(went)
college_player_salaries = salaries_df[salaries_df['went_to_college']]
non_college_player_salaries = salaries_df[salaries_df['went_to_college']==False]
In [5]:
#Salaries on year 2016
college_salaries2016 = college_player_salaries[college_player_salaries.yearID==2016]
non_college_salaries2016 = non_college_player_salaries[non_college_player_salaries.yearID==2016]
#Salaries on year 2015
college_salaries2015 = college_player_salaries[college_player_salaries.yearID==2015]
non_college_salaries2015 = non_college_player_salaries[non_college_player_salaries.yearID==2015]
#Salaries on year 2014
college_salaries2014 = college_player_salaries[college_player_salaries.yearID==2014]
non_college_salaries2014 = non_college_player_salaries[non_college_player_salaries.yearID==2014]
In [6]:
def put_salaries_hist(row, col, df, axes, lbl,year):
"""
Draws an histogram whit a line on the average, to depict the distribution
of the salaries data, and prints lables and legends.
"""
titlestr = '{}College Players, year {}'
xlabel = '\nIncome x10 Million Dollars\n\n'
#plt.title('Left Title', loc='left')
axes[ax,col].title.set_text(titlestr.format(lbl, year))
axes[ax,col].set_ylabel('Count')
#divide the average salalry in the data between 1 million
meanmd = df.salary.values.mean() /1000000
axes[ax,col].set_xlabel(xlabel)
#put the histogram
axes[ax,col].hist(df.salary.values)
#draw a line for the average of the group
axes[ax,col].axvline(df.salary.values.mean(), color='c', linestyle='dashed',
linewidth=2, label='Avg ${0:.2f} MD'.format(meanmd))
#show legends
legend = axes[ax,col].legend(loc='upper center', shadow=True)
In [7]:
%pylab inline
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(9, 10), sharey=True,sharex=False,edgecolor='k')
#put the resulting DataFrames into lists to easily get the charts
ylist = ['2014','2015','2016']
alist = [college_salaries2014 , college_salaries2015, college_salaries2016]
blist = [non_college_salaries2014 , non_college_salaries2015, non_college_salaries2016]
#row counter
ax = 0
#string to put label on each chart
for ax in range( len(alist)):
#the figure has 2 columns, the column on the left has the colleger players data
# the column on the right has the non college players data
year = ylist[ax]
#college players salaries
adf = alist[ax]
put_salaries_hist(ax,0,adf,axes,'',year)
#non college players salaries
adf = blist[ax]
put_salaries_hist(ax,1,adf,axes,'Non ',year)
ax = ax + 1
fig.subplots_adjust(hspace = 1.1,wspace=0.8)
In the above graphics, we can see an increase in the average income,
of the group of player who attended to college, but in that same window of time,
those players who did not attend college had a better average income at first
and then in the year 2016, college players were better paid.
Another thing to notice is that in the group of players that did not attend to college
there are considerably more players with income of fewer than 5 million dollars.
Let's find out plotting the mean for both groups over the years. Unfortunately,
we can only plot this since the first year salaries were recorded.
In [9]:
#get the avegare salary by year
#year average wages for college player
clgmeans = college_player_salaries.groupby('yearID').salary.mean()
#year average wages for non college player
nclgmeans = non_college_player_salaries.groupby('yearID').salary.mean()
#year average wages for all player
allmean = salaries_df.groupby('yearID').salary.mean()
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(8, 7), sharex=False)
#Draw a plot chart to contain the three results
axes.title.set_text('Average salary for all playes from 1985 ot 2016')
axes.plot(clgmeans,label='College', marker='o')
axes.plot(nclgmeans, label='Non College')
axes.plot(allmean, label='All', marker='+')
axes.set_xlabel('Year')
axes.set_ylabel('Player income')
legend = axes.legend(loc='upper center', shadow=True)
The line in green represents players who did not attend college.
The blue line with dots is mean income of players who attended college.
The red line is the average income of all players.
Apparently going to college may not guaranty a better income,
until recent years those trends have been switching.
But something may be wrong with this analysis,
I have not taken into account the effect of the outliers and the size of the groups.
I cannot gurantee that attending to college may give players a better chance to get a greater income.
===================================================================================================================
In [10]:
def count_college_players(row):
"""
Counts the number of college players a team has had every year,
only counts player who attended college on, or before the year of
each record. This function is designed to be used with a DataFram.apply call
works on values of a row and returns the count of players
args:
row a row of the master_df Dataframe
return:
the count of college players for a team on one year
"""
teamID = row['teamID']
yearID = row['yearID']
#from the appearances table het the player from each team that year
yrteampls = appearances_df[appearances_df['yearID']==yearID]
yrteampls = yrteampls[ yrteampls['teamID']==teamID]
#now get the players that went to college on, or before that year
clgplyr = college_df[college_df['yearID']<=yearID]
#merge the filtered appearances and college table
merged = yrteampls.merge(clgplyr,
on=['playerID'],
how='inner')
#by counting the number of resulting records of the merging
#we can know how many college player had that team on that year
count = len(merged)
return count
In [11]:
#count the number of college players each team had each year
teams_df['college_player_count'] = teams_df.apply(count_college_players,axis=1)
#get two groups, winners of World Series and Non winners
winnners_clg_plyrs = teams_df[teams_df.WSWin=='Y']
not_winnners_clg_plyrs = teams_df[teams_df.WSWin=='N']
In [12]:
def put_college_winners_scatter(col, axes, df, title):
"""
Draws a scatter plots of the count of college player in the DataFrame
args:
col - Column of the subplot
axes - Axes array of the matplotlib figure
df - DataFre with the used data
"""
axes[col].title.set_text(title)
axes[col].scatter(df['yearID'].values,
df['college_player_count'].values)
axes[col].set_xlabel('Year')
axes[col].set_ylabel('Number of college players')
In [13]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True,sharex=False)
put_college_winners_scatter(0,axes,winnners_clg_plyrs,
'Winners of World Series and\nnumber of college players')
put_college_winners_scatter(1,axes,not_winnners_clg_plyrs,
'Non winners of World Series and\nnumber of college players')
The comparison of both charts (winner of World Series at the left chart),
may tell us that having more college player does not appear to be an advantage.
In the chart of the right (not winners by year), we appreciate a lot more college players
compared to the winning teams.
In [14]:
winnners_clg_plyrs_b1910 = winnners_clg_plyrs[winnners_clg_plyrs['yearID']<=1910]
not_winnners_clg_plyrs_b1910 = not_winnners_clg_plyrs[not_winnners_clg_plyrs['yearID']<=1910]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True,sharex=False)
put_college_winners_scatter(0,axes,winnners_clg_plyrs_b1910,
'Winners of World Series and\nnumber of college players\nbefore 1910')
put_college_winners_scatter(1,axes,not_winnners_clg_plyrs_b1910,
'Non winners of World Series and\nnumber of college players\nbefore 1910')
In [15]:
winnners_clg_plyrs_50_70 = winnners_clg_plyrs[winnners_clg_plyrs['yearID']>=1950]
winnners_clg_plyrs_50_70 = winnners_clg_plyrs_50_70[winnners_clg_plyrs_50_70['yearID']<=1970]
not_winnners_clg_plyrs_50_70 = not_winnners_clg_plyrs[not_winnners_clg_plyrs['yearID']>=1950]
not_winnners_clg_plyrs_50_70 = not_winnners_clg_plyrs_50_70[not_winnners_clg_plyrs_50_70['yearID']<=1970]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True,sharex=False)
put_college_winners_scatter(0,axes,winnners_clg_plyrs_50_70,
'Winners of World Series and\nnumber of college players\nbeetwen 1950 and 1970')
put_college_winners_scatter(1,axes,not_winnners_clg_plyrs_50_70,
'Non winners of World Series and\nnumber of college players\nbeetwen 1950 and 1970')
There are some interesting spots in the charts, there are gaps between
1890 and 1904, approximately, also there are lapses with some decrease
in the number of college players, most notably between 1950 and 1970.
I assume this was caused by historical events like the great depression,
World war I and the Vietnam War, in the latest large numbers of young men where deployed on war.
=================================================================================================
By this I focus on the number of games won up to the split season,
obviously excluding post season games. I want to discover if there is a
correlation between the number of college players in each team and the number of won games.
In [16]:
#the teams_df now has a column with the count of college players
teams_df = teams_df.sort_values(by='yearID', ascending=False)
In [17]:
teams_clgplr2016 = teams_df[teams_df['yearID']==2016]
x = teams_clgplr2016.W.values
y = teams_clgplr2016.college_player_count.values
fig, axScatter = plt.subplots(figsize=(5.5, 5.5))
axScatter.title.set_text('Games won by team and number of college players\nOn year 2016')
axScatter.scatter(x,y,label='a label')
axScatter.set_xlabel('Games Won')
axScatter.set_ylabel('Number of college players')
Out[17]:
In this case for year 2016, those teams with more college students
hava won more games.
Lets find out how does it look for the last 20 years
In [18]:
teams_clgplr20yrs = teams_df[teams_df['yearID']>=(2016-20)]
x = teams_clgplr20yrs.W.values
y = teams_clgplr20yrs.college_player_count.values
fig, axScatter = plt.subplots(figsize=(5.5, 5.5))
axScatter.title.set_text('Games won by team and number of college players\nFrom 1996 to 2016')
axScatter.scatter(x,y,label='a label')
axScatter.set_xlabel('Games Won')
axScatter.set_ylabel('Number of college players')
Out[18]:
At least from what can be appreciated in the scatter plot,
not all the teams with more college players have better performance.
Again from the last year, that tendency appears to be changing.
Independently of the results of these very simplistic analyses,
playing in college is a major entry point to the major league,
further analysis could demonstrate that some college graduated players
are responsible for the change in the tendency of the results.
A deep, consistent analysis may not be possible with all the data provided,
for instance, is not possible to study the income of the players before 1985.
Another issue is that inflation was not taken into account for the salaries section.
Using annual averages for CPI taken from
http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/
is possible to, roughly, convert salaries to today's values.
Please note that for the present year 2017, I only use the average CPI
In [19]:
#load the CPI data
filename = 'cpi_1985_2017.csv'
cpi_df = pd.read_csv(filename)
def get_year_cpi(year):
try:
return cpi_df[cpi_df['yearID']==year].cpi.values[0]
except:
return 0
current_cpi = get_year_cpi(2017)
def convert_wages(row):
"""
This function convert salaries to today's values
using the Customer Price Index
"""
year = row['yearID']
old_value = row['salary']
old_cpi = get_year_cpi(year)
today_value = ((old_value * (current_cpi/old_cpi)))
return today_value
In [20]:
salaries_df['todays_salary'] = salaries_df.apply(convert_wages, axis=1)
In [21]:
allmean_cpi_adj = salaries_df.groupby('yearID').todays_salary.mean()
clgmeans = salaries_df[salaries_df['went_to_college']].groupby('yearID').todays_salary.mean()
nclgmeans = salaries_df[salaries_df['went_to_college'] == False].groupby('yearID').todays_salary.mean()
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(8, 7), sharex=False)
#Draw a plot chart to contain the three results
axes.title.set_text('Average salary for all playes from 1985 ot 2016\n Adjusted for inflation')
axes.plot(clgmeans,label='College', marker='o')
axes.plot(nclgmeans, label='Non College')
axes.plot(allmean_cpi_adj, label='All', marker='+')
axes.set_xlabel('Year')
axes.set_ylabel('Player income')
legend = axes.legend(loc='upper center', shadow=True)
The chart has almos the same shape, but here, is more evident that
arround 1998 and 2000, the salaries were very close, then they started
to split and finally on 2016 some just contributed to a switch in the dendency.
http://www.usinflationcalculator.com/frequently-asked-questions-faqs/#HowInflationCalculatorWorks http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/ https://stackoverflow.com/questions/45969828/combine-two-pandas-dataframe-into-one-new https://stackoverflow.com/questions/12754584/matplotlib-two-datasets-on-one-image https://pandas.pydata.org/pandas-docs/stable/index.html https://matplotlib.org/2.0.2/users/examples_index.html
In [ ]: